Skip to content

Give peers which are sending us messages/receiving messages from us longer to respond to ping #1137

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Oct 28, 2021

Conversation

TheBlueMatt
Copy link
Collaborator

This should largely resolve disconnection issues during initial sync, without the larger changes in #1023. It also resolves one None unwrap (which may or may not even be present on latest git).

@TheBlueMatt TheBlueMatt added this to the 0.0.103 milestone Oct 22, 2021
@codecov
Copy link

codecov bot commented Oct 22, 2021

Codecov Report

Merging #1137 (28b7430) into main (2bf39a6) will increase coverage by 1.10%.
The diff coverage is 82.03%.

❗ Current head 28b7430 differs from pull request most recent head 0caa8bb. Consider uploading reports for the commit 0caa8bb to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##             main    #1137      +/-   ##
==========================================
+ Coverage   90.57%   91.67%   +1.10%     
==========================================
  Files          67       69       +2     
  Lines       34567    43809    +9242     
==========================================
+ Hits        31309    40163    +8854     
- Misses       3258     3646     +388     
Impacted Files Coverage Δ
lightning-background-processor/src/lib.rs 94.68% <0.00%> (+0.45%) ⬆️
lightning/src/ln/peer_handler.rs 57.18% <82.40%> (+11.30%) ⬆️
lightning/src/util/test_utils.rs 86.30% <100.00%> (+4.53%) ⬆️
lightning/src/chain/mod.rs 50.00% <0.00%> (-11.12%) ⬇️
lightning/src/util/macro_logger.rs 86.59% <0.00%> (-1.29%) ⬇️
lightning/src/util/chacha20poly1305rfc.rs 96.77% <0.00%> (-1.19%) ⬇️
lightning/src/chain/chainmonitor.rs 95.35% <0.00%> (-1.01%) ⬇️
lightning/src/util/ser.rs 88.43% <0.00%> (-0.71%) ⬇️
lightning/src/util/byte_utils.rs 100.00% <0.00%> (ø)
lightning-invoice/src/payment.rs 92.83% <0.00%> (ø)
... and 24 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2bf39a6...0caa8bb. Read the comment docs.

@jkczyz jkczyz self-requested a review October 25, 2021 16:18
@@ -1148,14 +1182,15 @@ impl<Descriptor: SocketDescriptor, CM: Deref, RM: Deref, L: Deref, CMH: Deref> P
!peer.should_forward_channel_announcement(msg.contents.short_channel_id) {
continue
}
if peer.pending_outbound_buffer.len() > OUTBOUND_BUFFER_LIMIT_DROP_GOSSIP {
if peer.pending_outbound_buffer.len() > OUTBOUND_BUFFER_LIMIT_DROP_GOSSIP || peer.msgs_sent_since_pong > BUFFER_DRAIN_MSGS_PER_TICK * 2 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this check seems to be done a lot. Would it be possible to create a method such as can_queue_outbound_message(&peer) and skip if it returns false?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, its just done in this specific method - we dont limit non-gossip messages in the buffer (as we can't just drop HTLC messages or whatever).

if !peer.channel_encryptor.is_ready_for_encryption() || peer.their_node_id.is_none() {
// The peer needs to complete its handshake before we can exchange messages. We
// give peers one timer tick to complet handshake.
if peer.awaiting_pong_tick_intervals != 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can a peer that is not ready for encryption be already be owing us pongs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its more a deadline for when the peer completes the initial handshake - we reset awaiting_pong_tick_intervals to 0 after the handshake completes, I'll document better.

peers[1].read_event(&mut fd_b, &fd_a.outbound_data.lock().unwrap().split_off(0)).unwrap();
peers[1].process_events();
peers[0].read_event(&mut fd_a, &fd_b.outbound_data.lock().unwrap().split_off(0)).unwrap();
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't we add tests that check precisely the limits of the buffer constants and verify there are no off-by-one errors?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't check the message count exactly cause we just have bytes, but we can count how many iterations it took and check that.

@TheBlueMatt TheBlueMatt force-pushed the 2021-10-ping-fixes branch 4 times, most recently from 82ab64d to fa04659 Compare October 25, 2021 20:54
The `cmp::min` appeared to confused `end` for a count.
@TheBlueMatt TheBlueMatt force-pushed the 2021-10-ping-fixes branch 3 times, most recently from 698ef00 to 8616cf1 Compare October 26, 2021 01:47
In the coming commits simply calling `timer_tick_occurred` will no
longer disconnect all peers, so its helpful to have a utility
method.
@TheBlueMatt TheBlueMatt force-pushed the 2021-10-ping-fixes branch 2 times, most recently from cff18d1 to f714dca Compare October 26, 2021 17:30
@valentinewallace
Copy link
Contributor

Just successfully opened a channel to Matt with this PR!

@TheBlueMatt
Copy link
Collaborator Author

Added one more commit that is incredibly useful when grep'ing log entries - ensure we print who tan error is from.

@@ -285,14 +285,18 @@ enum InitSyncTracker{
NodesSyncing(PublicKey),
}

/// The ratio between buffer sizes at which we stop sending initial sync messages vs when we stop
/// forwarding messages to peers altogether.
const FORWARD_INIT_SYNC_BUFFER_SIZE_LIMIT: usize = 2;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The naming here is a bit strange. Is it really a limit? Also, why say "INIT_SYNC" here but "GOSSIP" below?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmmm, I think the confusion comes from OUTBOUND_BUFFER_LIMIT_READ_PAUSE being used for two things - when the buffer is "full" implying we stop reading new messages and when the buffer is "full" implying we stop putting more initial-gossip-sync messages in the buffer. The DROP_GOSSIP limit is for post-initial-sync gossip with a ratio of init-to-normal-forwards. Do you have a concrete suggestion for changing these names?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure. Was thinking something with "factor" in the name given it is multiplied. I think my main complaint is that SIZE_LIMIT seems to imply units of "number of messages" but this constant is really unit-less since it is a ratio of two size limits. So an expression like BUFFER_DRAIN_MSGS_PER_TICK * FACTOR should have units "number of messages per tick" not "number of messages squared per tick" as the SIZE_LIMIT naming would imply.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah! Ok, I'd somewhat misunderstood your complaint here, yea, definitely this needs RATIO or FACTOR in it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do the new fixup names sit for you?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

struct OptionalFromDebugger<'a>(&'a Option<PublicKey>);
impl core::fmt::Display for OptionalFromDebugger<'_> {
fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> Result<(), core::fmt::Error> {
if let Some(node_id) = self.0 { write!(f, " from {}", log_pubkey!(node_id)) } else { Ok(()) }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps keep the "from" in each message and use "peer" in the else case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That feels wrong? from peer isn't exactly a useful log string. Note that the need to introduce this wrapper is because of the optional public key itself, not the "from".

@@ -979,7 +986,7 @@ impl<Descriptor: SocketDescriptor, CM: Deref, RM: Deref, L: Deref, CMH: Deref> P
return Err(PeerHandleError{ no_connection_possible: false }.into());
}

log_info!(self.logger, "Received peer Init message: {}", msg.features);
log_info!(self.logger, "Received peer Init message from {}: {}", log_pubkey!(peer.their_node_id.unwrap()), msg.features);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the wrapper here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean the wrapper only exists to solve a very specific lifetime issue thanks for format!() being kinda braindead (creating references to every argument as-is instead of accepting references as-is).

@@ -493,6 +493,13 @@ impl<Descriptor: SocketDescriptor, RM: Deref, L: Deref> PeerManager<Descriptor,
}
}

struct OptionalFromDebugger<'a>(&'a Option<PublicKey>);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe name this something like MessageSender wrapping a Peer reference?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some docs explaining why this was added at all. Given its not really particularly specific to Peer I dunno if its worth renaming.

@TheBlueMatt TheBlueMatt force-pushed the 2021-10-ping-fixes branch 2 times, most recently from 7189cc3 to b4dcb56 Compare October 28, 2021 18:26
Copy link
Contributor

@arik-so arik-so left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ConorOkus
Copy link

Successfully opened a channel with Matt & Val as a result of this PR. Exciting!

tACK

This marginally simplifies coming commits.
This ensures we don't let a hung connection stick around forever if
the peer never completes the initial handshake.

This also resolves a race where, on receiving a second connection
from a peer, we may reset their_node_id to None to prevent sending
messages even though the `channel_encryptor`
`is_ready_for_encryption()`. Sending pings only checks the
`channel_encryptor` status, not `their_node_id` resulting in an
`unwrap` on `None` in `enqueue_message`.
@TheBlueMatt
Copy link
Collaborator Author

Squashed without changes, only diff since Arik's ACK is to fix compilation. Will land after CI:

$ git diff-tree -U1 b4dcb56 0caa8bb5d
diff --git a/lightning/src/ln/peer_handler.rs b/lightning/src/ln/peer_handler.rs
index 9a1b04853..0d8bc1c98 100644
--- a/lightning/src/ln/peer_handler.rs
+++ b/lightning/src/ln/peer_handler.rs
@@ -1160,3 +1160,3 @@ impl<Descriptor: SocketDescriptor, CM: Deref, RM: Deref, L: Deref, CMH: Deref> P
 					if peer.pending_outbound_buffer.len() > OUTBOUND_BUFFER_LIMIT_DROP_GOSSIP
-						|| peer.msgs_sent_since_pong > BUFFER_DRAIN_MSGS_PER_TICK * FORWARD_INIT_SYNC_BUFFER_SIZE_LIMIT
+						|| peer.msgs_sent_since_pong > BUFFER_DRAIN_MSGS_PER_TICK * FORWARD_INIT_SYNC_BUFFER_LIMIT_RATIO
 					{
@@ -1185,3 +1185,3 @@ impl<Descriptor: SocketDescriptor, CM: Deref, RM: Deref, L: Deref, CMH: Deref> P
 					if peer.pending_outbound_buffer.len() > OUTBOUND_BUFFER_LIMIT_DROP_GOSSIP
-						|| peer.msgs_sent_since_pong > BUFFER_DRAIN_MSGS_PER_TICK * FORWARD_INIT_SYNC_BUFFER_SIZE_LIMIT
+						|| peer.msgs_sent_since_pong > BUFFER_DRAIN_MSGS_PER_TICK * FORWARD_INIT_SYNC_BUFFER_LIMIT_RATIO
 					{
@@ -1209,3 +1209,3 @@ impl<Descriptor: SocketDescriptor, CM: Deref, RM: Deref, L: Deref, CMH: Deref> P
 					if peer.pending_outbound_buffer.len() > OUTBOUND_BUFFER_LIMIT_DROP_GOSSIP
-						|| peer.msgs_sent_since_pong > BUFFER_DRAIN_MSGS_PER_TICK * FORWARD_INIT_SYNC_BUFFER_SIZE_LIMIT
+						|| peer.msgs_sent_since_pong > BUFFER_DRAIN_MSGS_PER_TICK * FORWARD_INIT_SYNC_BUFFER_LIMIT_RATIO
 					{
$

@TheBlueMatt TheBlueMatt merged commit 070e22b into lightningdevkit:main Oct 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants